Chapter 4: Clustering & Classification

Data wrangling

Not included, as this time it is listed as the final assignment and deals with next week’s data. This has been completed; create_human.R” is available in the repo under Scripts.

Installation of the necessary packages

require(tidyverse);require(here);require(MASS);require(corrplot); require(GGally)

Analysis

Data overview

  • First, we load the Boston dataset and check its dimensions & structure.
data(Boston)

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
  • Overall the dimensions are 506 x 14.

  • All of the 14 variables appear numerical (integer or double).

  • Next, we look up what each of the 14 vars represents.

  • The dataset has an official documentation.

    • Input command “?Boston” to view.
    • In the documentation, variable definitions are provided.
  • The data contains information on housing in the Boston region, USA.

  • Included variables are:

    • CRIM - per capita crime rate by town
    • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
    • INDUS - proportion of non-retail business acres per town.
    • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
    • NOX - nitric oxides concentration (parts per 10 million)
    • RM - average number of rooms per dwelling
    • AGE - proportion of owner-occupied units built prior to 1940
    • DIS - weighted distances to five Boston employment centres
    • RAD - index of accessibility to radial highways
    • TAX - full-value property-tax rate per $10,000
    • PTRATIO - pupil-teacher ratio by town
    • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    • LSTAT - % lower status of the population
    • MEDV - Median value of owner-occupied homes in $1000’s
  • Next, graphical and table format summaries are generated for the data

  • First, summary as a table

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
  • The summary values for each var are wildly different. They are all on different scales, despite being all numeric. This is why standardization over vars will soon be done, to make them comparable.
ggpairs(Boston)

  • There’s quite a lot of options on what to look at here. I’m going to cherry pick some findings, instead of going through every variable.

  • Multiple variables have skewed distributions. For example:

    • Age skews strongly towards high values -> mainly old buildings (built before -1940)
    • Variable dis (distance to employment centers) has a skew towards low values (distance to employment centers commonly small)
  • Also, several variables have bimodality (= more than 1 peak in the distribution, meaning that 2 values are more common than others)

  • Scatterplot Indus x NOx appears to show 2 distinct groupings? For most data points, both NOX concentrations and the share of industrial acreage are low (these have a strong, statistically significant correlation coeff too!)

  • Crime rate appears to have a statistically significant correlation with almost all of these vars. Seems to correlate positively with the proportion of industrial acreage, NOX concentrations, large residential land areas…

  • The distribution of “indus” (business acreage) shows bimodality: we have two peaks, indicating that a couple of values are considerably more common than others. This variable also correlates w. high statistical significance with NOX emissions, which makes sense as the variable represents the prevalence of business acreage like industry.

  • Distribution of NOX is strongly skewed towards small values.

  • Age skews strongly towards high values. Overall, most construction in the regions of the data was done prior to 1940.

  • Again, we see bimodality in property tax rates. Low and high ends of the spectrum have clear peaks.

Dataset standardization

  • As explained above, all these variables are numeric but have wildly different measurement scales. Hence, standardization.
  • Let’s print summaries of both standardized and non-standardized data and compare
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
  • This procedure changed the scales on which all the different vars are measured. Previously, they were all different as they describe very different things. Now, we have “forced” all of them to a similar scale. See for example how the scale of “black” changes: max was almost 400, way more than for any other var. Due to standardization, it became 0.44. All the vars are now on the same scale.

Categorigal crime variable creation

  • Division to bins according to quantiles, set it as a new variable to the old frame
bins <- quantile(boston_scaled$crim) 

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))


boston_scaled <- data.frame(boston_scaled, crime)

boston_scaled$crim<-NULL

*Dividing the data to training set and testing set

n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = nrow(boston_scaled) * 0.8)

# create train set and test set
train <- boston_scaled[ind,] #Randomly selects 80% of rows, index numbers
test <- boston_scaled[-ind,] #everything except the indices in the training set

LDA

  • Fitting LDA on the training data, then plotting it.
  • Here’ we seek to find a linear combination of variables that best separates the data into groups.
  • Looking at vector directions & magnitudes tells us about which LD it has the most effect on
  • Variables’ relationships with each other can also be seen. Parallel vectors show correlation, while perpendicular directions are the opposite case.
lda.fit <- lda(crime ~ . , data = train)


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2)

  • RAD has the largest impact on LD1. For LD2 it lookse to be either Nox or zn, albeit the magnitude appears similar…
  • Nox and rad, as well as zn and rad, are almost perpendicular with each other (like a 90 degree angle) -> no correlation.
  • Most of the other vars cluster very close together. None of them have vector magnitudes even close to those seen for zn, rad and nox.

Testing model prediction power

  • First, we save the “real” class values of the test set, and remove it from the frame.
  • We will use these to compare how well the same model we used on the training data classifies the actual test data-
correct_classes<-test$crime
test$crime<-NULL


# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17       9        4    0
##   med_low    7      17        7    0
##   med_high   0       6       15    2
##   high       0       0        0   18
  • 30 correct “lows”, only 7 of which were correctly classified. 23% success rate…
  • 25 med-lows, of which 12 are correctly classified. 48% success rate, better but still not good enough.
  • 26 med-highs, just 9 correctly placed. 34 %…
  • 21 highs, of which 19 got correctly placed. A 90% success rate for this category -> high crime rates appear to be predicted very accurately, while in all other categories the performance is very lackluster.

K-Means

*Clearing .env, reload Boston raw data and scale it

data(Boston)

Boston_scaled<-as.data.frame(scale(Boston))

*Calculating distances between data points

dist_eu <- dist(Boston_scaled)

Run K-means clustering algorithm on the scaled data here, we run it on 6 centers as optimization will follow.

# k-means clustering
km <- kmeans(Boston_scaled, centers = "6")
pairs(Boston_scaled[1:6], col=km$cluster)

*Quite a confusing table…K of 6 is arguably too many.

*Optimizing K

  • In the optimization, we look at the WCSS (Within group sum of squares)
  • A high WCSS indicates that even though we have classified data into a same cluster group, there would still exist considerable variation within that group.
  • We want to find a K value, for which the variation within each group is as small as possible (i.e. the clusters should be formed with the idea of grouping similar data points together, with as little “wild” variation between points in the same group as possible)
  • Visually, we check the plotted curve, and observe at which point (K value) the most “benefit” has been gained from increasing K.
  • The slope of WCSS is steep and rapidly decreases up to K-level of circa 2. After that, we slope grows considerably milder -> even if we increase K, the resulting “gains” (or here, losses, as we want small WCSS values) grow less and less pronounced.
    • So, after K= circa 2, not really worth it to add any more centers…
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Re-running analysis with K set to 2

km <- kmeans(Boston_scaled, centers = 2)

pairs(Boston_scaled[1:5], col = km$cluster)

pairs(Boston_scaled[6:10], col = km$cluster)

  • With K at 2, it’s a lot easier to see what’s happening.
  • Let’s zoom into some of these:
  • Multiple distinct groupings. For example:
    • Black data points in the tax - age pairplot: cluster of properties high in both tax rate and age
pairs(Boston_scaled[c(1,10)], col = km$cluster)

  • Crime rates and high property tax values (wealth indicator?) appear to cluster separately. Wealthier neighbourhoods, less crime?
pairs(Boston_scaled[c(3,5)], col = km$cluster)

  • Industrial zoning and Nitrogen oxide emissions cluster as well. A distinct grouping of black data points show high emissions, when the rate of industrial zoning is also high. And when the zoning is not industrial, emissions are also low.
pairs(Boston_scaled[c(7,14)], col = km$cluster)

  • Housing value and age appear to show a large cluster in black, with high age (old housing) and low value.
pairs(Boston_scaled[c(1,14)], col = km$cluster)

Bonus task

  • Load original data, standardize, and run K means with 3 clusters
data("Boston")

Boston_scaled<-as.data.frame(scale(Boston))


km <- kmeans(Boston_scaled, centers = "5")
  • Then, fit LDA with clusters as the target (dependent)
lda.fit <- lda(km$cluster ~ . , data = Boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km$cluster)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2.4)

#More zoom
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =20)

  • Now, location on the riverside looks to be the main determinant for group 1. For 3, it appears to be accessibility to radial highways. Remaining variables cluster together in a single mass, with no single ones standing out.
  • In the more zoomed plot, LSTAT looks to be important for group 3, while zn is important for 2.

Super Bonus

  • Install plotly
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

*Calculate matrix products

lda.fit <- lda(crime ~ . , data = train)

model_predictors <- dplyr::select(train, -crime)

dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
  • 3D Plot

  • Colored by the crime classes of the training data, we get the pink “High crime” data points grouped on their own (some overlap with med-high, though)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

Data wrangling for next week

  • This has been completed; create_human.R” is available in the repo under Scripts and its output under Data.